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Abstract — This paper studies bivariate distributions with fixed 
marginals from an information-theoretic perspective. In particu- 
lar, continuity and related properties of various information mea- 
sures (Shannon entropy, conditional entropy, mutual information, 
Renyi entropy) on the set of all such distributions are investigated. 
The notion of minimum entropy coupling is introduced, and 
it is shown that it defines a family of (pseudo)metrics on the 
space of all probability distributions in the same way as the 
so-called maximal coupling defines the total variation distance. 
Some basic properties of these pseudometrics are established, in 
particular their relation to the total variation distance, and a 
new characterization of the conditional entropy is given. Finally, 
some natural optimization problems associated with the above 
information measures are identified and shown to be NP-hard. 
Their special cases are found to be essentially information- 
theoretic restatements of well-known computational problems, 
such as the Subset sum and Partition problems. 

Index Terms — Coupling, distributions with fixed marginals, 
information measures, Renyi entropy, continuity of entropy, en- 
tropy minimization, maximization of mutual information, subset 
sum, entropy metric, infinite alphabet, measure of dependence. 



I. Introduction 

DISTRIBUTIONS with fixed marginals have been studied 
extensively in the probability literature (see for example 
1 32 1 and the references therein). They are closely related to 
(and sometimes identified with, as will be the case in this 
paper) the concept of coupling, which has proven to be a 
very useful proof technique in probability theory ll3~5l . and in 
particular in the theory of Markov chains [25 . There is also 
rich literature on the geometrical and combinatorial properties 
of sets of distributions with given marginals, which are known 
as transportation polytopes in this context (see, e.g., 0). 
We investigate here these objects from a certain information- 
theoretic perspective. Our results and the general outline of 
the paper are briefly described below. 

Section [TT] provides definitions and elementary properties of 
the functionals studied subsequently - Shannon entropy, Renyi 
entropy, conditional entropy, mutual information, and infor- 
mation divergence. In Section [HI] we recall the definition and 
basic properties of couplings, i.e., bivariate distributions with 
fixed marginals, and introduce the corresponding notation. The 
notion of minimum entropy coupling, which will be useful in 
subsequent analysis, is also introduced here. In Section |IV] 
we discuss in detail continuity and related properties of the 
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above-mentioned information measures under constraints on 
the marginal distributions. These results complement rich lit- 
erature on the topic of extending the statements of information 
theory to the case of countably infinite alphabets. 

In Section [V] we define a family of (pseudo)metrics on 
the space of probability distributions, that is based on the 
minimum entropy coupling in the same way as the total 
variation distance is based on the so-called maximal coupling. 
The relation between these distances is derived from the Fano's 
inequality. Some other properties of the new metrics are also 
discussed, in particular an interesting characterization of the 
conditional entropy that they yield. 

In Section PVTl certain optimization problems associated with 
the above-mentioned information measures are studied. Most 
of them are, in a certain sense, the reverse problems of the 
well-known optimization problems, such as the maximum 
entropy principle, the channel capacity, and the information 
projections. The general problems of (Renyi) entropy mini- 
mization, maximization of mutual information, and maximiza- 
tion of information divergence are all shown to be intractable. 
Since mutual information is a good measure of dependence of 
two random variables, this will also lead to a similar result for 
all measures of dependence satisfying Renyi's axioms, and to 
a statistical scenario where this result might be of interest. 
The potential practical relevance of these problems is also 
discussed in this section, as well as their theoretical value. 
Namely, all of them are found to be basically restatements of 
some well-known problems in complexity theory. 

II. Information measures 

In this introductory section we recall the definitions and 
elementary properties of some basic information-theoretic 
functionals. All random variables are assumed to be discrete, 
with alphabet N - the set of positive integers, or a subset of 
N of the form {1, . . . ,n}. 

Shannon entropy of a random variable X with probability 
distribution P = (pi) (we also sometimes write P(i) for the 
masses of P) is defined as: 



H(X) = H(P) = -Y,Pdogp 



(1) 



with the usual convention OlogO = being understood. The 
base of the logarithm, b > 1, is arbitrary and will not be 
specified. H is a strictly concavd^ functional in P |9j- Further, 
for a pair of random variables (X, Y) with joint distribution 
S = (sij) and marginal distributions P = (pt) and Q = (qj), 
the following defines their joint entropy: 



H (X, Y) = H X , Y {S) = »id l °S * 



(2) 



1 To avoid possible confusion concave means n and convex means U. 
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conditional entropy: 



H{X\Y) = H x]Y (S) = -Y t 8 iJ log-y-, (3) 

hj J 

and mutual information: 

/(X; F) = (S) = log (4) 

Pi Qi 

%,3 

again with appropriate conventions. We will refer to the above 
quantities as the Shannon information measures. They are all 
related by simple identities: 



H(X,Y)=H(X) 
= H{X) 



H(Y)-I(X;Y) 
H(Y\X) 



(5) 



and obey the following inequalities: 



max {H(X), H(Y)} < H(X, Y) < H(X) + H(Y), (6) 
min {H(X), H(Y)} > I(X; Y) > 0, (7) 
< H{X\Y) < H(X). (8) 

The equalities on the right-hand sides of ©-(O are achieved 
if and only if X and Y are independent. The equalities on the 
left-hand sides of §6^ an d are achieved if and only if X 
deterministically depends on Y (i.e., iff X is a function of Y), 
or vice versa. The equality on the left-hand side of (|8) holds 
if and only if X deterministically depends on Y, We will use 
some of these properties in our proofs; for their demonstration 
we point the reader to the standard reference ||9l . 

From identities (|5]l one immediately observes the following: 
Over a set of bivariate probability distributions with fixed 
marginals (and hence fixed marginal entropies H(X) and 
H(Y)), all the above functionals differ up to an additive 
constant (and a minus sign in the case of mutual information), 
and hence one can focus on studying only one of them and 
easily translate the results for the others. This fact will also 
be exploited later. 

Relative entropy (information divergence, Kullback-Leibler 
divergence) D(P\\Q) is the following functional: 



D(P\\Q) =5_> lo s 



El 



(9) 



Finally, Renyi entropy ll29l of order a > of a random 
variable X with distribution P is defined as: 



H a {X) = H a {P) = -^—\ogY J 
1 — a ^-^ 



Pi 



with 



H (P) = ]im H a (P) = log|P| 



(10) 



(11) 



where \P\ — \{i : pi > 0}| denotes the size of the support of 
P, and 

H X {P) = lim H a (P) = H(P). (12) 
One can also define: 



Joint Renyi entropy of the pair (X, Y) having distribution S = 
(sij) is naturally defined as: 



H, 



,(X,Y) = H a (S) = -!— log J24,y d4) 
1 — a * — ' J 



By using subadditivity (for a < 1) and superadditivity (for 
a > 1) properties of the function x a one concludes that: 

H a (X,Y)>max{H a (X),H a (Y)} (15) 

with equality if and only if X is a function of Y, or vice versa. 
However, Renyi analogue of the right-hand side of © does 
not hold unless a = or a = 1 [1 1. In fact, no upper bound 
on the joint Renyi entropy in terms of the marginal entropies 
can exist for < a < 1, as will be illustrated in Section HVI 



III. Couplings of probability distributions 

A coupling of two probability distributions P and Q is a 
bivariate distribution S (on the product space, in our case N 2 ) 
with marginals P and Q. This concept can also be defined 
for random variables in a similar manner, and it represents a 
powerful proof technique in probability theory [35 1. 



Let ri x) and T 



(2) 



denote the sets of one- and two- 



dimensional probability distributions with alphabets of size n 
and n x to, respectively: 



It 15 = (Pi) € 



-.(2) 



>0,^>, = lj (16) 
: Pi,j >0, Y^ptj =1 \ (17) 



and let C(P,Q) denote the set of all couplings of P G T 
and Q G V 



(i) 



(i). 



C(P,Q) = { Se r 



(2) Ts- 

nxm • / j °i,3 



Pii s i,j 



(18) 

It is easy to show that the sets C(P, Q) are convex and closed 

(2) (2) 

in r„x m - They are also clearly disjoint and cover entire T n ^ m , 

(2) 

i.e., they form a partition of T n ^ rn . Finally, they are parallel 
affine (n— l)(m— 1) -dimensional subspaces of the (n-m — 1)- 

(2) 

dimensional space rj,^. (We have in mind the restriction of 
the corresponding affine spaces in M nxm to K™ xm .) 

The set of distributions with fixed marginals is basically 
the set of matrices with nonnegative entries and prescribed 
row and column sums (only now the total sum is required to 
be one, but this is inessential). Such sets are special cases of 
the so-called transportation polytopes Q. 

We shall also find it interesting to study information mea- 
sures over the sets of distributions whose one marginal and 
the support of the other are fixed: 



C(P,m)= |J C(P,Q). 



(19) 



H OQ (P)= lim J ff a (P) = -logmaxp i . (13) r (2) for p g r 



These sets are also convex polytopes and form a partition of 

(i) 
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A. Minimum entropy couplings 

We now introduce one special type of couplings which will 
be useful in subsequent analysis. 

Definition 1: Minimum entropy coupling of probability dis- 
tributions P and Q is a bivariate distribution S* G C(P,Q) 
which minimizes the entropy functional H = Pixy, i-e., 

H(S*) = inf H(S). (20) 

SeC(P,Q) 



Minimum entropy couplings exist for any P G T^ 1 ' and 
Q G rSi' because sets C(P,Q) are compact (closed and 

(2) 

bounded) and entropy is continuous over r„^ m and hence 
attains its extrema. (Note, however, that they need not be 
unique.) From the strict concavity of entropy one concludes 
that the minimum entropy couplings must be vertices of 
C(P,Q) (i.e., they cannot be expressed as aS + (1 — a)T, 
with S,T G C(P,Q), a G (0,1)). Finally, from identities 
(0 it follows that the minimizers of Hx,y over C(P,Q) are 
simultaneously the minimizers of H x \y an d H Y \x an d the 
maximizers of Ixy, and hence could also be called maximum 
mutual information couplings for example. 

Definition 1 (cont.): Minimum a-entropy coupling of prob- 
ability distributions P and Q is a bivariate distribution S* G 
C(P,Q) which minimizes the Renyi entropy functional H a . 

Similarly to the above, existence of the minimum a-entropy 
couplings is easy to establish, as is the fact that they must be 
vertices of C(P, Q) (H a is concave for < a < 1; for a > 1 
it is neither concave nor convex [4| but the claim follows from 
the convexity of £\ • sf j). 

IV. Infinite alphabets 

We now establish some basic properties of information 
measures over C(P, Q) and C(P, to), and of the sets C(P, Q) 
and C(P, m) themselves, in the case when the distributions P 
and Q have possibly infinite supports. The notation is similar 
to the finite alphabet case, for example: 



r (1) = |(pi) ieN :pi>0,^pi = l|, 
T (2) = j(Pi,i)i,j£N : Pi,j > o, E ft j = 1 j- 



(21) 



The following well-known claim will be useful. We give a 
proof for completeness. 

Lemma 2: Let / : A -t E, with A C R closed, be a 
continuous nonnegative function. Then the functional F(x) = 
2~2if( x i)> x = ( x x> x 2, ■ ■ is lower semi-continuous in I 1 
topology. 

Proof: Let \\x^ — x\ x —> 0. Then, by using nonnegativ- 
ity and continuity of /, we obtain 



liminf F(x (n) ) = liminf V /(: 



i=l 



> 



lim inf V /(: 



(22) 



A" 



where the fact that — > implies |a;i 

V?, was also used. Letting if->cowe get 

liminfF(a; (n) ) > F(x), 



n—>oo 

which was to be shown. 



0, 
(23) 



A. Compactness of C(P,Q) andC(P,m) 

Let l\ = {(Xijhjzn : 2~2i,j\ x i,j\ < °°}- This is me 
familiar I 1 space, only defined for two-dimensional sequences. 
It clearly shares all the essential properties of l l , completeness 
being the one we shall exploit. The metric understood is: 



\x-y\\ 



y, 



(24) 



for x,y G i\. In the context of probability distributions, this 
distance is usually called the total variation distance (actually, 
it is twice the total variation distance, see d49|i). 

Theorem 3: For any P, Q G and to G N, C(P, Q) and 
C(P, m) are compact. 

Proof: A metric space is compact if and only if it is 
complete and totally bounded ||27l Thm 45.1]. These facts are 
demonstrated in the following two propositions. ■ 

Proposition 4: C(P,Q) and C(P,m) are complete metric 
spaces. 

Proof: It is enough to show that C(P,Q) and C(P, to) 
are closed in l\ because closed subsets of complete spaces are 
always complete [|27l . In other words, it suffices to show that 
for any sequence S n G C(P,Q) converging to some S G i\ 
(in the sense that \\S„ - ->• 0), we have S G C{P,Q). 
This is straightforward. If S n all have the same marginals (P 
and Q), then S must also have these marginals, for otherwise 
the distance between S n and S would be lower bounded by 
the distance between the corresponding marginals: 



E 



\S(i,j)-S n (i,j)\>J2 
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S(i,j) - S n (i,j) 



(25) 



and hence could not decrease to zero. The case of C(P, to) is 
similar. ■ 
For our next claim, recall that a set E is said to be totally 
bounded if it has a finite covering by e-balls, for any e > 0. In 
other words, for any e > 0, there exist x\, . . . , xk G E such 
that E C U fe B(xfc,e), where B(xk,e) denotes the open ball 
around x^ of radius e. The points x\, . . . , xk are then called 
an e-net for E. 

Proposition 5: C(P,Q) and C(P,m) are totally bounded. 
Proof: We prove the statement for C(P, Q), the proof for 
C(P, to) is very similar. Let P, Q, and e > be given. We need 
to show that there exist distributions Si, ... , Sk G C(P, Q) 
such that C{P,Q) C (J fc B(£fe,e), an d this is done in the 
following. There exists N such that X^iV+i Pi < | ar, d 
2^°^ A , +1 < |. Observe the truncations of the distri- 
butions P and Q, namely (pi, . . . ,Pjv) an d (qi, . . . , (j'jv). 



Assume that ^ 2~2j=i Qj< an d let r 



SjLi 9j (otherwise, just interchange P and Q). Now let 

= (Pi,..-,Pn) a nd Q^' = (gi,..., g w ,r), 
and observe C(P( N \ Q( N >% (Adding r was necessary for 
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C(p( N \Q( N ' r '>) to be nonempty.) This set is closed (see 
the proof of Proposition |4) and bounded in M. Nx ( N+1 \ 
and hence it is compact by the Heine-Borel theorem. This 
further implies that it is totally bounded and has an |-net, 
i.e., there exist T X ,...,T K G C{P {N \ Q^) such that 
C(P^ N \Q^ N ^) C U fe B(T fe ,|). Now construct distributions 
S U ...,S K G C(P,Q) by "padding" T U ...,T K . Namely, 
take Sk to be any distribution in C(P,Q) which coincides 
with Tk on the first N x N coordinates, for example: 



15"- 



Sk(i,j) 



'T k (i,j), %,j<N 

0, j <N,i> N 

T fc (i, N + 1) • qj /ET=n+i 1 3 , i<N,j>N 

.Pi-Qj/T^=N+i9j' i,j>N. 



Note that \\T t - < 



(26) 

| (where we understand that 



Tt(i,j) — for i > N or j > N + 1). We prove below 
that Sfc's are the desired e-net for C(P,Q), i.e., that any 
distribution S G C(P, Q) is at distance at most e from some Si, 
I G {].,...,#} ftlli < e). Observe some S G C(P,Q), 

and let 5" be its N x JV truncation: 



5'(i,j) = 



0, otherwise. 



(27) 



Note that S" is not a distribution, but that does not affect the 
proof. Note also that the marginals of S' are bounded from 
above by the marginals of S, namely q'j = ^ S'(i,j) < qj 
and v \ = Y,j S'(iJ) < Pi. Finally, we have ||5- S% < § 
because the total mass of S on the coordinates where i > N 
or j > N is at most |. The next step is to create S" G 
CiP^^Q^^) by adding masses to S' on the Nx(N + l) 
rectangle. One way to do this is as follows. Let 



(28) 



hi -pi, 


i < N 




\o, 


i>N' 






J<N 




< r, 


i = iVH 


-1 




j >iVH 


-1 



(29) 



and let U = (ui), and y = (u,-), and c = 
Now define 5"' by: 

S" = S'+-UxV. (30) 

c 

It is easy to verify that S" G C(PW,QW) and that 
|| S" — S"' || x < | because the total mass added is 



N 



JV oo 
i=l 3=1 



JV 

E 

1=1 j = AT+l 



(31) 



< 



E 

3=/V+l 



Now recall that T fe 's form an |-net for C(PW,QW r )) and 
consequently that there exists some T^, £ G {!,..., K}, with 



T^Hj^ < |. To put this all together, write: 



\\S"-T t \\. 
which completes the proof. 



|5'-5"|| 1 + 
\\Ti - SfHj < e, 



(32) 



B. Continuity of Shannon information measures 

Shannon information measures are known to be discontin- 
uous functionals in general |fT71 , 11391 . Imposing certain re- 
strictions on the marginal distributions and entropies, however, 
ensures their continuity. 

Theorem 6: Let P,Q G I^ 1 ) and m G N, and assume that 
Q has finite entropy. Then Shannon information measures are 
continuous over C(P, Q) and C(P, m). 

Proof: The claim can be established by using [14, Thm 
4.3] and exhibiting the cost-stable codes for these statistical 
models, but we also give here a more direct proof (which will 
then be extended to prove Theorem |8). Write 



H Y (S) = Ix-,y(S) + H y]x (S). 



(33) 



The functional H Y \x(S) = Yli j s i,j 1°£ ^ s l° wer semi- 
continuous by Lemma |2] The functional Ix,y is also lower 
semi-continuous since 



Ix-,y(S) = D(S\\P x Q), 



(34) 



and information divergence D(S\\T) is known to be jointly 
lower semi-continuous in the distributions S and T [36|. But 
since the sum of these two functionals is a constant Hy(S) = 
H(Q) < oo, both of them must be continuous. The continuity 
of H x \y an d Hx.y now follows from (0. 

Now consider C(P,m). In (TTl it is shown that H(Y\X) 
and I(X; Y) are continuous when the alphabet of Y is finite 
and fixed, which is what we have here. And since H(X) = 
H(P) is fixed, H{X\Y) and H{X 1 Y) are also continuous (if 
H(P) = oo then they are infinite over the entire C(P, m), but 
we also take this to mean that they are continuous). ■ 

In fact, since C(P,Q) and C(P,m) are compact, Shannon 
information measures are uniformly continuous over these 
domains, for any P. Q with finite entropy, and meN. 

Combining the above results, we obtain the following. 

Theorem 7: Let P,Q G and m G N, and assume that Q 
has finite entropy. Then Shannon information measures attain 
their extreme values over C(P, Q) and C(P, m). 

Proof: The claim follows from Theorems |3] and |6] Every 
continuous function attains its infimum and supremum over a 
compact set (3TJ Thm 4.16]. ■ 

This claim is obsolete for the maximum entropy distribution 
because it is easy to find it. Namely, P x Q = (piqj) 
maximizes entropy over C(P,Q) as is easily seen from ©, 
and P x U m is the maximizer over C(P, m), where U m is the 
uniform distribution over {1, . . . , m}. But it is much harder to 
find the minimum entropy distribution, as we will show, and 
its existence is not obvious when the alphabets are unbounded. 

The argument in the proof of Theorem |6] can easily be 
adapted to prove the following more general claim. 

Theorem 8: Let S n ,S G T^ 2 ) be bivariate probability dis- 
tributions. If S n converges to S (\\S n — S^ —> 0) in such a 
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way that H x (S n ) -> H X (S) and H Y (S n ) -> Hy(S), and if 
at least one of these marginal entropies is finite, then we must 
have: 



(35) 



H x , Y (S n ) -> H X<Y (S), Ix; Y (S n ) -»■ I X ;y(S) 

H x \y{Su) -> H Y{x (S n ) -> Hy^S). 

Proof: As in the proof of Theorem [6] we observe that 
liminfn-^oo Hy\x(Su) > -ffy|x(5') (by LemmaO, and that 
liminfn-^oo 7x;k(Syj) > I X;Y (S) which follows from (|34| > 
and the fact that when S n — > 5, then also P„ x Q„ — > P x 
Q, where P„ , Q n , and P, Q are the marginals of S n and 5, 
respectively. But since H Y (S n ) — > H Y {S) by assumption, one 
sees from (1331 that both of these inequalities must in fact be 
equalities. The remaining claims in ( |35l l then follow from (0. 

■ 

The previous claim establishes that if \\S n — S\\i — > 0, then 
a sufficient condition for the convergence of joint entropy is 
the convergence of marginal entropies. It is also necessary, as 
the following theorem shows. 

Theorem 9: Let S n , S be bivariate probability distributions 
such that \\S n - S]^ -> and H x . Y (S n ) -> H X , Y (S) < oo. 
Then H x (S n ) -> H X (S) and H Y (S n ) ->■ H Y (S), and conse- 
quently, all claims in ( f35l > hold. 

Proof: The claim follows from the identity Hx,Y(S n ) = 
Hx(S n ) + H Y i x (S n ), and the fact that Hx and H Y \ X are 
both lower semi-continuous. ■ 

C. (Dis)continuity of Renyi entropy 

Renyi entropy H a is known to be a continuous functional 
for a > 1 (see, e.g., (24|) and it of course remains continuous 
over C(P,Q) and C(P, m). Therefore, it is also bounded and 
attains its extrema over these domains. It is, however, in 
general discontinuous for a £ [0,1], and its behavior over 
C(P,Q) and C(P,m) needs to be examined separately. The 
case a = 1 (Shannon entropy) has been settled in the previous 
subsection, so in the following we assume that a £ [0, 1). 

Theorem 10: H a is continuous over C(P,m), for any a > 
0. For a = it is discontinuous for any m > 2. 

Proof: Let < a < 1. If P Q (P) = oo, then # Q (S) = 
oo for any S E C(P,rn) and there is nothing to prove, so 
assume that H a (P) < oo. Let S n be a sequence of bivariate 
distributions converging to S, and observe: 



(36) 



1,3 



Since^ S n (i,j) < P(i) and ££i ££=i ^(0° = 
m £j = i P(i) a < oo by assumption, it follows from the 
Weierstrass criterion I13T1 Thm 7.10] that the series 
converges uniformly (in n) and therefore: 



lim S n(i,j) a = E Jim 5 n (i, jT 



(37) 



1,3 



which gives H a (S n ) — > H a (S). 

As for the case a — it is easy to exhibit a sequence S n — > 
5 such that the supports of S n strictly contain the support of 



S, i.e., \S n \ > \S\, implying that lim n _,oo H (S n ) > H (S). 
The case m = 1 is uninteresting because C(P, 1) = {P}- ■ 

Unfortunately, continuity over C(P, Q) fails in general, as 
we discuss next. 

Theorem 11: For any a £ (0,1) there exist distributions 
P,Q with H a (P) < oo and H a (Q) < oo, such that H a is 
unbounded over C(P, Q). 

Proof: Let P = Q = {pi) and assume that the p^s 
are monotonically nonincreasing. Define S n with S n (i,j) = 
^ + Eij for i,j 6 {1, . . . , n}, where Eij > are chosen to 
obtain the correct marginals and r > 1, and S n (i,j) = Pidij 
otherwise, where Sij is the Kronecker's delta. Then S n £ 
C(P,Q), and 



i,3 



i=l j=l 



2— ra a 



(38) 



Now, if p„ decreases to zero slowly enough, the previous 
expression will tend to oo when n — > oo for appropriately 



chosen r. For example, let p n 
whenever 2 — ra — (3a > 0, i.e.. 



P > 1. Then 



+ /3 < 2a- 



we will 



have limn-j.oo H a (S n ) — oo. Furthermore, if /3a > 1, then 
H a (P) < oo. Therefore, for a given a £ (0,1), we have 
found distributions P and Q with finite entropy of order a, 
such that H a is unbounded over C(P, Q). ■ 

It is known that Renyi entropy H a satisfies H a (X, Y) < 
H a (X) + H a (Y) only for a — and a = 1. Such an upper 
bound does not hold for a £ (0, 1), and, in fact, no upper 
bound on H a (X, Y) in terms of H a (X) and H a (Y) can exist, 
as Theorem fTTI shows. 

Corollary 12: For any a £ (0, 1) there exist distributions P 
and Q such that H a is discontinuous at every point of C(P, Q). 

Proof: Let P and Q be such that H a is unbounded over 
C(P,Q). Let S be an arbitrary distribution from C(P,Q). 
It is enough to show that H a remains unbounded in any 
neighborhood of S. Let M > be an arbitrary number, 
and e G (0,1). We can find T £ C(P,Q) with H a (T) as 
large as desired, so assume that £^ ^ if ^ > M/e. Observe the 
distribution (1 — e)S' + eT. It is in 2e-neighborhood of S since 
US'- ((1 - e)S + eT)\\ 1 = eWS-T]^ < 2e. Also, since the 
function x Q is concave for a < 1, we get: 



(39) 



which completes the proof. ■ 
The case of a = (Hartley entropy) remains; the proof of 
the following result is straightforward. 

Theorem 13: Hq is discontinuous over C(P,Q), for any 
distributions P and Q with supports of size at least two. 

Note that, unlike for the Shannon information measures, 
we cannot claim in general that H a attains its supremum 
over C(P,Q), for a < 1. However, infimum is attained, i.e., 
minimum a-entropy coupling always exists, because Renyi 
entropy is lower semi-continuous [24 1, and any such function 
must attain its infimum over a compact set |fl9l . 
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We next prove that, although H a is discontinuous for some 
P and Q, the continuity still holds for a wide class of marginal 
distributions. 

Theorem 14: If J^i 3 mm {Pi, Qj} a < °°> then H a is con- 
tinuous over C(P, Q), for any a > 0. For P = Q = {pi), with 
Pi's nonincreasing, this condition reduces to ^ i • pf < oo. 

Proq/: Let S n -> 5, where S"„,5 <G C(P,Q). Since, 
over C(P,Q), S n (i,j) < mm{pi,qj} and by assumption 
^ ^ minjpi, g.,}" < oo, we can apply the Weierstrass cri- 
terion to conclude that J2i j Sn{i,j) a converges uniformly in 
n and therefore that H a (S n ) — ► H a (S). 

Now let P = Q and assume that the p^s are monotonically 
nonincreasing. Then mm{pi,pj} =p max {i,j}> i- e -, 



because H a (S n ) — ^ H a {S). Therefore, 



(mm{pi,pj}) = 



( Pi P2 P3 

P2 P2 P3 

P3 P3 P3 

\[ i i 



(40) 



By observing the elements above (and including) the diagonal, 
it follows that: 



(41) 



and hence the condition £\ i ■ pf < oo is equivalent to 
minfe,^}" < oo. ■ 

Finally, let us prove a result analogous to Theorem [9] 
Theorem 15: Let SV^S be bivariate probability distribu- 
tions such that \\S n - S\\ t -> and H a (S n ) -> H a {S) < oo. 
Let P n ,Q n be the marginals of S n , and P,Q the marginals 
of S. Then iT«(P„) -> ff Q (P) and H a (Q n ) -> H a (Q). 

Proof: If HS'n - x 0, then of course ||P„ - P]^ 
and \\Q n - Q\\ x -> 0. Write: 



5>„(i,j) a =E p »« a - 



(42) 



EE 



5„(i,ir-p„(i) c 



We are interested in showing that the first term on the right- 
hand side converges to P(i) a , which is equivalent to 
saying that H a (P n ) — > H a (P). Observe that this term is lower 
semi-continuous by Lemma [2] meaning that 



lira inf > 

77 — ^no ' * 



p„w a >E p w c 



(43) 



The second term on the right-hand side of (l42l is also lower 
semi-continuous for the same reason, namely: 



E< 

3 



s n (i,j) a -p n (i) a >o 

because the function x a is subadditive, and 



(44) 



n— >oo \ — ' / z — ' 

\ 3 / 3 

(45) 



limmfV ( J2s n (i,j) a - p n ar) > 

i \ j / 



(46) 



or, since £. j j)° -> ^ 

lira sup ^P„(i) a < V"P(i) a . (47) 

n— >-oo 

Now g3) and (@7) give H a (P n ) -> H a (P), and H a (Q n ) -> 
H a (Q) follows by symmetry. ■ 
Note that the opposite implication does not hold for any 
a e [0, 1), as Corollary IT2l shows. Namely, if ||5„ - -> 0, 
convergence of the marginal entropies (H a (P n ) — > H a (P) 
and H a (Q n ) — > H a (Q)) does not imply convergence of the 
joint entropy H a (S n ) — > H a (S). 

V. Entropy metrics 

Apart from many of their other uses, couplings are very 
convenient for defining metrics on the space of probability 
distributions. There are many interesting metrics defined via 
so-called "optimal" couplings. We first illustrate this point 
using one familiar example, and then define new information- 
theoretic metrics based on the minimum entropy coupling. 

Given two probability distributions P and Q, one could 
measure the "distance" between them as follows. Consider 
all possible random pairs (X, Y) with marginal distributions 
P and Q. Then define some measure of dissimilarity of X 
and Y, for example P(X ^ Y), and minimize it over all such 
couplings (minimization is necessary for the triangle inequality 
to hold). Indeed, this example yields the well-known total 
variation distance 



(48) 



d w (P,Q)= inf P(X^Y), 
C(P,Q) 



where the infimum is taken over all joint distributions of the 
random vector (X, Y) with marginals P and Q. Notice that a 
minimizing distribution (called a maximal coupling, see, e.g., 
Il33l ) in d48j is "easy" to find because F(X ^ Y) is a linear 
functional in the joint distribution of (X,Y). For the same 
reason, d\(P,Q) is easy to compute, but this is also clear 
from the identity ll25l : 



rfv(PQ) 



Jei^ 



(49) 



Now let us define some information-theoretic distances in 
a similar manner. Let (X, Y) be a random pair with joint 
distribution S and marginal distributions P and Q. The total 
information contained in these random variables is H(X. Y), 
while the information contained simultaneously in both of 
them (or the information they contain about each other) is 
measured by I(X;Y). One is then tempted to take as a 
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measure of their dissimilarity 

A X (X,Y) = Ai(5) - ff(X,y) - I(X;Y) 
= H(X\Y) + H(Y\X). 



(50) 



Indeed, this quantity (introduced by Shannon [34|, and usually 
referred to as the entropy metric [10|) satisfies the properties 
of a pseudometric [10|. In a similar way one can show that 
the following is also a pseudometric: 

Aoo(A-,y) = Aoo(5) = max{/f(X|y),£r(y|X)}, (51) 

as are the normalized variants of Ai and Aoo (8). These 
pseudometrics have found numerous applications (see for 
example [40 1) and have also been considered in an algorithmic 
setting 0. 

Remark 1: Ai is a pseudometric on the space of random 
variables over the same probability space. Namely, for Ai 
to be defined, the joint distribution of (X, Y) must be given 
because joint entropy and mutual information are not defined 
otherwise. Equation ( f55T > below defines the distance between 
random variables (more precisely, between their distributions) 
that does not depend on the joint distribution. 

One can further generalize these definitions to obtain a 
family of pseudometrics. This generalization is akin to the 
familiar £ p distances. Let 

A P (X,Y) = A P (S) = (H(X\Y)p + H(Y\X)p)K (52) 

for p > 1. Observe that lim p _).oo A p (X, Y) = A^X, Y), 
justifying the notation. 

Theorem 16: A p (X,Y) satisfies the properties of a pseu- 
dometric, for all p E [1, oo]. 

Proof: Nonnegativity and symmetry are clear, as is the 
fact that A p (X,Y) = if (but not only if) X = Y with 
probability one. The triangle inequality remains. Following the 
proof for Ai from ifTol Lemma 3.7], we first observe that 
H{X\Y) < H(X\Z) + H(Z\Y), wherefrom: 



\ P (X,Y)< ([H{X\Z)+H{Z\Y)) P + 
(H(Y\Z) + H(Z\X)) P Y 



(53) 



Now apply the Minkowski inequality (||a + b|L < ||a|| p 
||6|| ) to the vectors a = (H(X\Z), H(Z\X)) and b 
[H{Z\Y),H{Y\Z)) to get: 



A P (X, Y) < A P (X, Z) + A P (Z, Y), 



(54) 



which was to be shown. ■ 
Having defined measures of dissimilarity, we can now define 
the corresponding distances: 



(55) 



The case p = 1 has also been analyzed in some detail in 
1 37 1, motivated by the problem of optimal order reduction for 
stochastic processes. 

Theorem 17: A p is a pseudometric on TW, for any p g 
[Loo]. 

2 Drawing a familiar information-theoretic Venn diagram [9 makes it clear 
that this is a measure of "dissimilarity" of two random variables. 



Proof: Since A p satisfies the properties of a pseudomet- 
ric, we only need to show that these properties are preserved 
under the infimum. 1° Nonnegativity is clearly preserved, 
A p > 0. 2° Symmetry is also preserved, A(P,Q) = 
A p (Q, P). 3° If P = Q then A p (P, Q) = 0. This is because 
S = diag(P) (distribution with masses pi — q$ on the diagonal 
and zeroes elsewhere) belongs to C(P,Q) in this case, and 
for this distribution we have Hx\y(S) = Hy\x(S) = 0. 
4° The triangle inequality is left. Let X, Y and Z be 
random variables with distributions P, Q and R, respectively, 
and let their joint distribution be specified. We know that 
A P (X,Y) < A p (X,Z) + A p (Z,Y), and we have to prove 
that 

inf AJX,Y)< inf AJX,Z)+ inf AJZ.Y). 

C(P,Q) P C(P,R) P C(R,Q) F ' 

(56) 

Since, from the above, 



inf A p {X,Y) 

C(P,Q) 



= inf A p (X,Y) 

C(P,Q,R) 

< inf {A p {X,Z) + A p {Z,Y)} 

C(P,Q,R) 1 ' 



(57) 



it suffices to show that 

ci M R) {A pi X,Z) + A p (Z,Y )} 



inf AJX,Z)+ inf AJZ.Y) 
C{P,R) p C(R,Q) F ' 



(58) 



(C(P,Q,R) denotes the set of all three-dimensional distri- 
butions with one-dimensional marginals P, Q, and R, as 
the notation suggests.) Let T e C{P,R) and U e C(R,Q) 
be the optimizing distributions on the right-hand side (rhs) 
of d58l . Observe that there must exist a joint distribution 
W € C(P, Q, R) consistent with T and U (for example, take 
w i,j,k — ti,kUk,j/fk)- Since the optimal value of the lhs is 
less than or equal to the value at W, we have shown that the 
lhs of d58l is less than or equal to the rhs. For the opposite 
inequality observe that the optimizing distribution on the lhs 
of d58l ) defines some two-dimensional marginals T E C(P, R) 
and U E C(R, Q), and the optimal value of the rhs must be 
less than or equal to its value at (T, U). ■ 

Remark 2: If A p (P,Q) = 0, then P and Q are a per- 
mutation of each other. This is easy to see because only in 
that case can one have Hx\y(S) = Hy\x(S) = 0, for some 
S E C(P, Q). Therefore, if distributions are identified up to a 
permutation, then A p is a metric. In other words, if we think 
of distributions as unordered multisets of nonnegative numbers 
summing up to one, then A p is a metric on such a space. 

Observe that the distribution defining A (P, Q) is in fact the 
minimum entropy coupling. Thus minimum entropy coupling 
defines the distances A p on the space of probability distri- 
butions in the same way maximal coupling defines the total 
variation distance. However, there is a sharp difference in the 
computational complexity of finding these two couplings, as 
will be shown in the following section. 

A. Some properties of entropy metrics 

We first note that A p is a monotonically nonincreasing 
function of p. In the following, we shall mostly deal with Aj 
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and A^, but most results concerning bounds and convergence 
can be extended to all A p based on this monotonicity property. 

The metric gives an upper bound on the entropy 
difference \H(P) — H(Q)\. Namely, since 

\H{X)-H(Y)\ = \H(X\Y)-H(Y\X)\ 

<H{X\Y)+H(Y\X) (59) 

= A 1 (x,n 

we conclude that: 

\H(P)-H(Q)\ < A X (P,Q). (60) 

Therefore, entropy is continuous with respect to this pseu- 
dometric, i.e., A x (P n ,P) ->■ implies H(P n ) -» H(P). 
Bounding the entropy difference is an important problem in 
various contexts and it has been studied extensively, see for 
example [18], [33'] . In particular, [33 1 studies bounds on the 
entropy difference via maximal couplings, whereas (l60l is 
obtained via minimum entropy couplings. 

Another useful property, relating the entropy metric A^ and 
the total variation distance, follows from Fano's inequality: 

H{X\Y) < P(X ± Y) log(|X| - 1) + h(P(X £ Y)), (61) 

where \X\ denotes the size of the support of X, and h(x) = 
~x\og 2 (x) — (1 — x)log 2 (l — x), x € [0, 1], is the binary 
entropy function. Evaluating the rhs at the maximal coupling 
(the joint distribution which minimizes P(X ^ Y)), and the 
lhs at the minimum entropy coupling, we obtain: 

A 1 (P,Q)<d Y (P,Q)log{\P\\Q\) + 2h(d y (P,Q)). (62) 

This relation makes sense only when the alphabets (supports 
of P and Q) are finite. When the supports are also fixed it 
shows that A x is continuous with respect to dy, i.e., that 
d w {P n ,P) -!• implies A^P^P) ->■ 0. By Pinsker's 
inequality [ 10 1 then it follows that Aj is also continuous with 
respect to information divergence, i.e., D(P n \\P) — > implies 
A 1 (P„,P)^0. 

The continuity of Aj with respect to <iy fails in the case 
of infinite (or even finite, but unbounded) supports, which 
follows from (l60l and the fact that entropy is a discontinuous 
functional with respect to the total variation distance. One can, 
however, claim the following. 

Proposition 18: If P n — > P in the total variation distance, 
and H(P n ) ->■ H(P) < oo, then A^P^P) -> 0. 

Proof: In |[T6l Thm 17] it is shown that if 
d v (Px„,Px) -> and H(X n ) H(X) < oo, then 
P(X„ Y n ) ->• implies H(X n \Y n ) ->■ 0, for any r.v.'s 
Y n . Our claim then follows by specifying Px n = P n , 
Px = Py„ = P, and taking infimums on both sides of the 
implication. ■ 

We also note here that sharper bounds than the above can 
be obtained by using A^ instead of A v For example: 

\H(P)-H(Q)\<A 00 (P,Q), (63) 

(with equality whenever the minimum entropy coupling of P 
and Q is such that Y is a function of X, or vice versa), and: 

A^P, Q) < dy(P, Q) logmax{|P|, |Q|} + h(d y (P, Q)). 

(64) 



We conclude this section with an interesting remark on the 
conditional entropy. First observe that the pseudometric A p 
(A p ) can also be defined for random vectors (multivariate 
distributions). For example, A\((X, Y), (Z)) is well-defined 
by H(X,Y\Z) + H(Z\X,Y). If the distributions of (X, Y) 
and Z are S and R, respectively, then minimizing the above 
expression over all tri-variate distributions with the corre- 
sponding marginals S and R would give A 1 (5, R). Further- 
more, random vectors need not be disjoint. For example, we 
have: 

A 1 ((X), (X,Y)) = H(X\X,Y) + H(X,Y\X) = H(Y\X), 

(65) 

because the first summand is equal to zero. Therefore, the 
conditional entropy H(Y\X) can be seen as the distance 
between the pair (X, Y) and the conditioning random variable 
X. If the distribution of (X, Y) is S, and the marginal 
distribution of X is P, then: 

A 1 (P,S) = H Ylx (S), (66) 

because S is the only distribution consistent with these con- 
straints. In fact, we have A p (P, S) = Hy\x(S) for all p E 
[1, oo]. Therefore, the conditional entropy H{Y\X) represents 
the distance between the joint distribution of (X, Y) and the 
marginal distribution of the conditioning random variable X. 

VI. Optimization problems 

In this final section we analyze some natural optimization 
problems associated with information measures over C(P,Q) 
and C(P,m), and establish their computational intractabil- 
ity. The proofs are not difficult, but they have a number 
of important consequences, as discussed in Section IVI-CI 
and, furthermore, they give interesting information-theoretic 
interpretations of well-known problems in complexity theory, 
such as the SUBSET SUM and the PARTITION problems. 
Some closely related problems over C(P,Q), in the context 
of computing A 1 (P,<5), are also studied in (37). 

A. Optimization over C(P, Q) 

Consider the following computational problem, called MIN- 
IMUM ENTROPY COUPLING: Given P = (pi,...,p n ) and 
Q = ((ft , . . . , q m ) (with pi , qj 6 Q), find the minimum entropy 
coupling of P and Q. It is shown below that this problem is 
NP-hard. The proof relies on the following well-known NP- 
complete problem [ 1 2 1 : 

Problem: Subset sum 

Instance: Positive integers d\, . . . , d n and s. 

Question: Is there a J C {l,...,n} such that 

T,je.J d 3 = s ? 

Theorem 19: MINIMUM ENTROPY COUPLING is NP-hard. 
Proof: We shall demonstrate a reduction from the SUB- 
SET sum to the Minimum entropy coupling. Let there 
be given an instance of the SUBSET SUM, i.e., a set of 
positive integers s;d\, . . . , d n , n > 2. Let D — 
and let pi = di/D, q = s/D (assume that s < D, the 
problem otherwise being trivial). Denote P = (pi, ■ ■ ■ ,p n ) 
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and Q = (q, 1 — q). The question we are trying to answer is 
whether there is a J C {1, . . . , n} such that 2~2jeJ — s > i- e -> 
such that Ylje j Pj = 1- Observe that this happens if and only 
if there is a matrix S with row sums P = (p%, . . . ,p n ) and 
column sums Q = (q, 1 — q), which has exactly one nonzero 
entry in every row (or, in probabilistic language, a distribution 
S € C(P, Q) such that Y deterministically depends on X). We 
know that in this case, and only in this case, the entropy of S 
would be equal to H(P) |9|, which is by (O a lower bound 
on entropy over C(P, Q). In other words, if such a distribution 
exists, it must be the minimum entropy coupling. Therefore, if 
we could find the minimum entropy coupling, we could easily 
decide whether it has one nonzero entry in every row, thereby 
solving the given instance of the SUBSET SUM. ■ 

Now from (0 we conclude that the problems of mini- 
mization of the conditional entropies and maximization of 
the mutual information over C(P,Q) are also NP-hard. Fur- 
thermore, in the same way as above one can define the 
problem MINIMUM a-ENTROPY COUPLING, for any a > 0, 
and establish its NP-hardness. Note that the reverse problems 
over C(P, Q), entropy maximization for example, are trivial for 
Shannon information measures. In the case of Renyi entropy 
the problem is in general not trivial, but it can be solved by 
standard convex optimization methods. 

It would be interesting to determine whether the MINIMUM 
ENTROPY COUPLING belongs to FNF0, but this appears to be 
quite difficult. Namely, given the optimal solution, it is not 
obvious how to verify (in polynomial time) that it is indeed 
optimal. A similar situation arises with the decision version 
of this problem: Given P and Q and a threshold h, is there 
a distribution S EC(P, Q) with entropy H(S) < hi Whether 
this problem belongs to NP is another interesting question 
(which we shall not be able to answer here). The trouble with 
these computational problems is that R-valued functions are 
involved. Verifying, for example, that H(S) < h might not be 
computationally trivial as it might seem because the numbers 
involved are in general irrational. We shall not go into these 
details further; we mention instead one closely related problem 
which has been studied in the literature: 

Problem: Sqrt sum 

Instance: Positive integers di, . . . , d n , and k. 
Question: Decide whether J27=i < k ? 

This problem, though "conceptually simple" and bearing 
certain resemblance with the above decision version of the 
entropy minimization problem, is not known to be solvable in 
NP fl31 (it is solvable in PSPACE). 

B. Optimization over C(P, to) 

Minimization of the joint entropy H(X, Y) over C(P, to) is 
trivial. The reason is that H(X, Y) > H(P) with equality iff 
Y deterministically depends on X, and so the solution is any 
joint distribution having at most one nonzero entry in each row 
(the same is true for H a , a > 0). Since H(X) is fixed, this 
also minimizes the conditional entropy H(Y\X). The other 

3 The class FNP captures the complexity of function problems associated 
with decision problems in NP, see 1281 . 



two optimization problems considered so far, minimization of 
H(X\Y) and maximization of I(X;Y), are still equivalent 
because I(X;Y) = H(X) - H(X\Y), but they turn out to 
be much harder. Therefore, in the following we shall consider 
only the maximization of I(X;Y). 

Let Optimal channel be the following computational 
problem: Given P = (pi,... ,p n ) and m (with pi € Q,m E 
N), find the distribution S E C(P,m) which maximizes the 
mutual information. This problem is the reverse of the channel 
capacity in the sense that now the input distribution (the 
distribution of the source) is fixed, and the maximization is 
over the conditional distributions. In other words, given a 
source, we are asking for the channel with a given number of 
outputs which has the largest mutual information. Since the 
mutual information is convex in the conditional distribution 
[9 1, this is again a convex maximization problem. 

We describe next the well-known Partition (or Number 
partitioning) problem lfl2l . 

Problem: Partition 
Instance: Positive integers di, . . . , d n . 
Question: Is there a partition of {di, . . . , d n } into two 
subsets with equal sums? 

This is clearly a special case of the SUBSET SUM. It can be 
solved in pseudo-polynomial time by dynamic programming 
methods [12|. But the following closely related problem is 
much harder. 

Problem: 3 -Partition 

Instance: Nonnegative integers di , . . . , d^ m and k with 
k/4 < dj < k/2 and £\ d 3 = mk. 

Question: Is there a partition of {l,...,3m} into 
m subsets Ji , . . . , J m (disjoint and cover- 
ing {1, ... ,3m}) such that J2jeJ are all 
equal? (The sums are necessarily k and every 
Ji has 3 elements.) 

This problem is NP-complete in the strong sense 0~2], i.e., no 
pseudo-polynomial time algorithm for it exists unless P=NP. 

Theorem 20: OPTIMAL CHANNEL is NP-hard. 

Proof: We prove the claim by reducing 3-Partition to 
Optimal channel. Let there be given an instance of the 
3-PARTITION problem as described above, and let pi = di/D, 
where D — 2~2i^i- Deciding whether there exists a partition 
with described properties is equivalent to deciding whether 
there is a matrix C E C(P,m) with the other marginal 
Q being uniform and C having at most one nonzero entry 
in every row (i.e., Y deterministically depending on X). 
This on the other hand happens if and only if there is a 
distribution C E C(P,m) with mutual information equal to 
H(Q) — log to, which is by (O an upper bound on Ix-y over 
C(P,m). The distribution C would therefore necessarily be 
the maximizer of Ixy- To conclude, if we could solve the 
Optimal channel problem with instance (pi, . . . ,P3m', m), 
we could easily decide whether the maximizer is such that it 
has at most one nonzero entry in every row, thereby solving 
the original instance of the 3-PARTITION problem. ■ 

Note that the problem remains NP-hard even when the 
number of channel outputs (m) is fixed in advance and is 
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not a part of the input instance. For example, maximization of 
Ix-y over C(P, 2) is essentially equivalent to the PARTITION 
problem. 

It is easy to see that the transformation in the proof of 
Theorem l20l is in fact pseudo-polynomial [12| which implies 
that Optimal channel is strongly NP-hard and, unless 
P=NP, has no pseudo-polynomial time algorithm. 

C. Some comments and generalizations 

1) Entropy minimization: Entropy minimization, taken in 
the broadest sense, is a very important problem. Watanabe 
1 38 1 has shown, for example, that many algorithms for clus- 
tering and pattern recognition can be characterized as suitably 
defined entropy minimization problems. 

A much more familiar problem in information theory is 
that of entropy maximization. The so-called Maximum entropy 
principle formulated by Jaynes [20 1 states that, among all 
probability distributions satisfying certain constraints (express- 
ing our knowledge about the system), one should pick the 
one with maximum entropy. It has been recognized by Jaynes, 
as well as many other researchers, that this choice gives the 
least biased, the most objective distribution consistent with the 
information one possesses about the system. Consequently, the 
problem of maximizing entropy under constraints has been 
thoroughly studied (see, e.g., iT4l . lETl ). It has also been 
argued [22], [41], however, that minimum entropy distributions 
can be of as much interest as maximum entropy distributions. 
The MinMax information measure, for example, has been 
introduced in [22 1 as a measure of the amount of information 
contained in a given set of constraints, and it is based both on 
maximum and minimum entropy distributions. 

One could formalize the problem of entropy minimization 
as follows: Given a polytope (by a system of inequalities with 
rational coefficients, say) in the set of probability distribu- 
tions, find the distribution S* which minimizes the entropy 
functional H. (If the coefficients are rational, then all the 
vertices are rational, i.e., have rational coordinates. Therefore, 
the minimum entropy distribution has finite description and 
is well-defined as an output of a computational problem.) 
This problem is strongly NP-hard and remains such over 
transportation polytopes, as established above. 

2) Renyi entropy minimization: The problem of minimiza- 
tion of the Renyi entropies H a over arbitrary polytopes is 
also strongly NP-hard, for any a > 0. Note that, for a > 1, 
this problem is equivalent to the maximization of the i a norm 
(see also [26 1, [6| for different proofs of the NP-hardness of 
norm maximization). Interestingly, however, the minimization 
of is polynomial-time solvable; it is equivalent to the 
maximization of the £°° norm [26]. For a < 1, the minimiza- 
tion of Renyi entropy is equivalent to the minimization of £ a 
(which is not a norm in the strict sense), a problem arising in 
compressed sensing [13]. 

Hence, as we have seen throughout this section, various 
problems from computational complexity theory can be re- 
formulated as information-theoretic optimization problems. 
(Observe also the similarity of the SQRT SUM and the mini- 
mization of Renyi entropy of order 1/2.) 



3) Other information measures: Maximization of mutual 
information is a very important problem in the general context. 
The so-called Maximum mutual information criterion has 
found many applications, e.g., for feature selection [2| and 
the design of classifiers [15]. Another familiar example is that 
of the capacity of a communication channel which is defined 
precisely as the maximum of the mutual information between 
the input and the output of a channel. 

We have illustrated the general intractability of the problem 
of maximization of Ix-y by exhibiting two simple classes of 
polytopes over which the problem is strongly NP-hard (and we 
have argued that the same holds for the conditional entropy). 

We also mention here one possible generalization of this 
problem - maximization of information divergence. Namely, 
since for S G C(P,Q): 

I X ;y(S) = D(S\\P x Q), (67) 

one can naturally consider the more general problem of max- 
imization of D(S\\T) when S belongs to some convex region 
and T is fixed. Formally, let INFORMATION DIVERGENCE 
MAXIMIZATION be the following computational problem: 
Given a rational convex polytope IT in the set of probability 
distributions, and a distribution T, find the distribution S € II 
which maximizes D(-\\T). This is again a convex maximiza- 
tion problem because D(S\\T) is strictly convex in S [10|. 

Corollary 21: INFORMATION DIVERGENCE MAXIMIZA- 
TION is NP-hard. 

Note that the reverse problem, namely the minimization of 
information divergence, defines an information projection of 
T onto the region IT |[T0l . 

4) Measures of statistical dependence: We conclude this 
section with one more generalization of the problem of max- 
imization of mutual information. Namely, this problem can 
also be seen as a statistical problem of expressing the largest 
possible dependence between two given random variables. 

Consider the following statistical scenario. A system is 
described by two random variables (taking values in N) whose 
joint distribution is unknown; only some constraints that it 
must obey are given. The set of all distributions satisfying 
these constraints is usually called a statistical model. 

Example 1: Suppose we have two correlated information 
sources obtained by independent drawings from a discrete 
bivariate probability distribution, and suppose we only have 
access to individual streams of symbols (i.e., streams of 
symbols from either one of the sources, but not from both 
simultaneously) and can observe the relative frequencies of 
the symbols in each of the streams. We therefore "know" 
probability distributions of both sources (say P and Q), but we 
don't know how correlated they are. Then the "model" for this 
joint source would be C(P, Q). In the absence of any additional 
information, we must assume that some S £ C(P,Q) is the 
"true" distribution of the source. 

Given such a model, we may ask the following question: 
What is the largest possible dependence of the two random 
variables? How correlated can they possibly be? This question 
can be made precise once a dependence measure is specified, 
and this is done next. 
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A. Renyi [30] has formalized the notion of probabilistic 
dependence by presenting axioms which a "good" dependence 
measure p should satisfy. These axioms, adapted for discrete 
random variables, are listed below. 

(A) p(X, Y) is defined for any two random variables X, Y, 
neither of which is constant with probability 1. 

(B) < p(X, Y)<1. 

(C) p(X,Y)=p(Y,X). 

(D) p(X, Y) = iff X and Y are independent. 

(E) p(X, Y) = 1 iff X = f(Y) or Y = g{X). 

(F) If / and g are injective functions, then p(f(X), g(Y)) = 
p{X,Y). 

Actually, Renyi considered axiom (E) to be too restrictive and 
demanded only the "if part". It has been argued subsequently 
[01 , however, that this is a substantial weakening. We shall 
find it convenient to consider the stronger axiom given above. 
As an example of a good measure of dependence, one could 
take precisely the mutual information; its normalized variant 
I(X;Y)/mm{H(X),H(Y)} satisfies all the above axioms. 

We can now formalize the question asked above. Namely, 
let MAXIMAL p— DEPENDENCE be the following problem: 
Given two probability distributions P = (pi,...,p n ) and 
Q = (qi, . . . , q m ), find the distribution S € C(P,Q) which 
maximizes p. The proof of the following claim is identical 
to the one given for mutual information (entropy) in Section 
IVI-AI and we shall therefore omit it. 

Theorem 22: Let p be a measure of dependence satisfying 
Renyi's axioms. Then MAXIMAL p— DEPENDENCE is NP-hard. 

The intractability of the problem over more general statis- 
tical models is now a simple consequence. 
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